{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 🤗 Welcome to AdalFlow!\n",
    "## The PyTorch library to auto-optimize any LLM task pipelines\n",
    "\n",
    "Thanks for trying us out, we're here to provide you with the best LLM application development experience you can dream of 😊 any questions or concerns you may have, [come talk to us on discord,](https://discord.gg/ezzszrRZvT) we're always here to help! ⭐ <i>Star us on <a href=\"https://github.com/SylphAI-Inc/AdalFlow\">Github</a> </i> ⭐\n",
    "\n",
    "\n",
    "# Quick Links\n",
    "\n",
    "Github repo: https://github.com/SylphAI-Inc/AdalFlow\n",
    "\n",
    "Full Tutorials: https://adalflow.sylph.ai/index.html#.\n",
    "\n",
    "Deep dive on each API: check out the [developer notes](https://adalflow.sylph.ai/tutorials/index.html).\n",
    "\n",
    "Common use cases along with the auto-optimization:  check out [Use cases](https://adalflow.sylph.ai/use_cases/index.html).\n",
    "\n",
    "# Outline\n",
    "This is the colab complementary to:\n",
    "* [LLM evaluation guideline](https://adalflow.sylph.ai/tutorials/evaluation.html)\n",
    "* [Source code](https://github.com/SylphAI-Inc/AdalFlow/tree/main/tutorials/evaluation)\n",
    "\n",
    "\n",
    "Introducing LLM evaluations with a focus on the generative tasks instead of classical Natural language understanding tasks.\n",
    "\n",
    "* Natural language Generation(NLG) metrics\n",
    "* RAG evaluation:\n",
    "    * RAG AnswerMatch\n",
    "    * RAG Retriever Recall\n",
    "\n",
    "\n",
    "\n",
    "# Installation\n",
    "\n",
    "1. Use `pip` to install the `adalflow` Python package. We will need `openai`, `groq`, and `faiss`(cpu version) from the extra packages.\n",
    "\n",
    "  ```bash\n",
    "  pip install adalflow[openai,groq,faiss-cpu]\n",
    "  ```\n",
    "2. Setup  `openai` and `groq` API key in the environment variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ensure version >= v0.2.1\n",
    "from IPython.display import clear_output\n",
    "\n",
    "!pip install -U adalflow[openai]\n",
    "\n",
    "clear_output()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set Environment Variables\n",
    "\n",
    "Run the following code and pass your api key.\n",
    "\n",
    "Note: for normal `.py` projects, follow our [official installation guide](https://lightrag.sylph.ai/get_started/installation.html).\n",
    "\n",
    "*Go to [OpenAI](https://platform.openai.com/docs/introduction) and [Groq](https://console.groq.com/docs/) to get API keys if you don't already have.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "API keys have been set.\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "\n",
    "from getpass import getpass\n",
    "\n",
    "# Prompt user to enter their API keys securely\n",
    "openai_api_key = getpass(\"Please enter your OpenAI API key: \")\n",
    "\n",
    "\n",
    "# Set environment variables\n",
    "os.environ[\"OPENAI_API_KEY\"] = openai_api_key\n",
    "\n",
    "print(\"API keys have been set.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 😇 Classical Text metrics and issues\n",
    "\n",
    "We will use `Torchmetrics` to compute the classical text metrics like BLEU, ROUGE.\n",
    "\n",
    "We choose a case where the ground truth(references) means the same as the generated text, but where BLEU and ROUGE are not able to capture the similarity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install torchmetrics\n",
    "\n",
    "clear_output()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "gt = \"Brazil has won 5 FIFA World Cup titles\"\n",
    "pred = \"Brazil is the five-time champion of the FIFA WorldCup.\"\n",
    "\n",
    "\n",
    "def compute_rouge(gt, pred):\n",
    "    r\"\"\"\n",
    "    https://lightning.ai/docs/torchmetrics/stable/text/rouge_score.html\n",
    "    \"\"\"\n",
    "    from torchmetrics.text.rouge import ROUGEScore\n",
    "\n",
    "    rouge = ROUGEScore()\n",
    "    return rouge(pred, gt)\n",
    "\n",
    "\n",
    "def compute_bleu(gt, pred):\n",
    "    r\"\"\"\n",
    "    https://lightning.ai/docs/torchmetrics/stable/text/bleu_score.html\n",
    "    \"\"\"\n",
    "    from torchmetrics.text.bleu import BLEUScore\n",
    "\n",
    "    bleu = BLEUScore()\n",
    "    # preds = [\"the cat is on the mat\"]\n",
    "    # target = [[\"there is a cat on the mat\", \"a cat is on the mat\"]]\n",
    "    # score = bleu(preds, target)\n",
    "    # print(f\"score: {score}\")\n",
    "    # print(f\"pred: {[pred]}, gt: {[[gt]]}\")\n",
    "    return bleu([pred], [[gt]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'rouge1_fmeasure': tensor(0.2222),\n",
       " 'rouge1_precision': tensor(0.2000),\n",
       " 'rouge1_recall': tensor(0.2500),\n",
       " 'rouge2_fmeasure': tensor(0.),\n",
       " 'rouge2_precision': tensor(0.),\n",
       " 'rouge2_recall': tensor(0.),\n",
       " 'rougeL_fmeasure': tensor(0.2222),\n",
       " 'rougeL_precision': tensor(0.2000),\n",
       " 'rougeL_recall': tensor(0.2500),\n",
       " 'rougeLsum_fmeasure': tensor(0.2222),\n",
       " 'rougeLsum_precision': tensor(0.2000),\n",
       " 'rougeLsum_recall': tensor(0.2500)}"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compute_rouge(gt, pred)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tensor(0.)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compute_bleu(gt, pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 🤗  Embedding-based Metrics -- BERTScore\n",
    "\n",
    "To make up for this, embedding-based  metrics or neural evaluators such as BERTScore was created.\n",
    "You can find BERTScore in both `Hugging Face Metrics <https://huggingface.co/metrics>`_ and `TorchMetrics <https://lightning.ai/docs/torchmetrics/stable/text/bertscore.html>`_.\n",
    "BERTScore uses pre-trained contextual embeddings from BERT and matched words in generated text and reference text using cosine similarity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_bertscore(gt, pred):\n",
    "    r\"\"\"\n",
    "    https://lightning.ai/docs/torchmetrics/stable/text/bert_score.html\n",
    "    \"\"\"\n",
    "    from torchmetrics.text.bert import BERTScore\n",
    "\n",
    "    bertscore = BERTScore()\n",
    "    return bertscore([pred], [gt])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/liyin/Documents/test/LightRAG/.venv/lib/python3.12/site-packages/torchmetrics/utilities/prints.py:43: UserWarning: The argument `model_name_or_path` was not specified while it is required when the default `transformers` model is used. It will use the default recommended model - 'roberta-large'.\n",
      "  warnings.warn(*args, **kwargs)  # noqa: B028\n",
      "/Users/liyin/Documents/test/LightRAG/.venv/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'precision': tensor(0.9752), 'recall': tensor(0.9827), 'f1': tensor(0.9789)}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compute_bertscore(gt, pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 🤗  LLM As Judge\n",
    "\n",
    "AdalFlow provides a very customizable LLM judge, which can be used in three ways:\n",
    "\n",
    "1. With question, ground truth, and generated text\n",
    "2. Without question, with ground truth, and generated text\n",
    "3. Without question, without ground truth, with generated text\n",
    "\n",
    "And you can customize the `judgement_query` towards your use case or even the whole llm template.\n",
    "\n",
    "AdalFlow LLM judge returns `LLMJudgeEvalResult` which has three fields:\n",
    "1. `avg_score`: average score of the generated text\n",
    "2. `judgement_score_list`: list of scores for each generated text\n",
    "3. `confidence_interval`: a tuple of the 95% confidence interval of the scores\n",
    "\n",
    "\n",
    "`DefaultLLMJudge` is an LLM task pipeline that takes a single question(optional), ground truth(optional), and generated text and returns the float score in range [0,1].\n",
    "\n",
    "You can use it as an `eval_fn` for AdalFlow Trainer.\n",
    "\n",
    "`LLMAsJudge` is an evaluator that takes a list of inputs and returns a list of `LLMJudgeEvalResult`.\n",
    "Besides of the score, it computes the confidence interval of the scores."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# without questions, and with customized judgement query\n",
    "\n",
    "\n",
    "def compute_llm_as_judge_wo_questions():\n",
    "    from adalflow.eval.llm_as_judge import LLMasJudge, DefaultLLMJudge\n",
    "    from adalflow.components.model_client import OpenAIClient\n",
    "\n",
    "    llm_judge = DefaultLLMJudge(\n",
    "        model_client=OpenAIClient(),\n",
    "        model_kwargs={\n",
    "            \"model\": \"gpt-4o\",\n",
    "            \"temperature\": 1.0,\n",
    "            \"max_tokens\": 10,\n",
    "        },\n",
    "        jugement_query=\"Does the predicted answer means the same as the ground truth answer? Say True if yes, False if no.\",\n",
    "    )\n",
    "    llm_evaluator = LLMasJudge(llm_judge=llm_judge)\n",
    "    print(llm_judge)\n",
    "    eval_rslt = llm_evaluator.compute(gt_answers=[gt], pred_answers=[pred])\n",
    "    print(eval_rslt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DefaultLLMJudge(\n",
      "  judgement_query= Does the predicted answer means the same as the ground truth answer? Say True if yes, False if no., \n",
      "  (model_client): OpenAIClient()\n",
      "  (llm_evaluator): Generator(\n",
      "    model_kwargs={'model': 'gpt-4o', 'temperature': 1.0, 'max_tokens': 10}, trainable_prompt_kwargs=['task_desc_str', 'examples_str']\n",
      "    (prompt): Prompt(\n",
      "      template: <START_OF_SYSTEM_PROMPT>\n",
      "      {# task desc #}\n",
      "      {{task_desc_str}}\n",
      "      {# examples #}\n",
      "      {% if examples_str %}\n",
      "      {{examples_str}}\n",
      "      {% endif %}\n",
      "      <END_OF_SYSTEM_PROMPT>\n",
      "      ---------------------\n",
      "      <START_OF_USER>\n",
      "      {# question #}\n",
      "      {% if question_str is defined %}\n",
      "      Question: {{question_str}}\n",
      "      {% endif %}\n",
      "      {# ground truth answer #}\n",
      "      {% if gt_answer_str is defined %}\n",
      "      Ground truth answer: {{gt_answer_str}}\n",
      "      {% endif %}\n",
      "      {# predicted answer #}\n",
      "      Predicted answer: {{pred_answer_str}}\n",
      "      <END_OF_USER>\n",
      "      , prompt_kwargs: {'task_desc_str': 'You are an evaluator. Given the question(optional), ground truth answer(optional), and predicted answer, Does the predicted answer means the same as the ground truth answer? Say True if yes, False if no.', 'examples_str': None}, prompt_variables: ['task_desc_str', 'examples_str', 'pred_answer_str', 'question_str', 'gt_answer_str']\n",
      "    )\n",
      "    (model_client): OpenAIClient()\n",
      "  )\n",
      ")\n",
      "true\n",
      "LLMJudgeEvalResult(avg_score=1.0, judgement_score_list=[1], confidence_interval=(0, 1))\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/liyin/Documents/test/LightRAG/.venv/lib/python3.12/site-packages/numpy/core/_methods.py:206: RuntimeWarning: Degrees of freedom <= 0 for slice\n",
      "  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,\n",
      "/Users/liyin/Documents/test/LightRAG/.venv/lib/python3.12/site-packages/numpy/core/_methods.py:198: RuntimeWarning: invalid value encountered in scalar divide\n",
      "  ret = ret.dtype.type(ret / rcount)\n"
     ]
    }
   ],
   "source": [
    "compute_llm_as_judge_wo_questions()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# with questions and default judgement query\n",
    "def compute_llm_as_judge():\n",
    "    from adalflow.eval.llm_as_judge import LLMasJudge, DefaultLLMJudge\n",
    "    from adalflow.components.model_client import OpenAIClient\n",
    "\n",
    "    questions = [\n",
    "        \"Is Beijing in China?\",\n",
    "        \"Is Apple founded before Google?\",\n",
    "        \"Is earth flat?\",\n",
    "    ]\n",
    "    pred_answers = [\"Yes\", \"Yes, Appled is founded before Google\", \"Yes\"]\n",
    "    gt_answers = [\"Yes\", \"Yes\", \"No\"]\n",
    "\n",
    "    llm_judge = DefaultLLMJudge(\n",
    "        model_client=OpenAIClient(),\n",
    "        model_kwargs={\n",
    "            \"model\": \"gpt-4o\",\n",
    "            \"temperature\": 1.0,\n",
    "            \"max_tokens\": 10,\n",
    "        },\n",
    "    )\n",
    "    llm_evaluator = LLMasJudge(llm_judge=llm_judge)\n",
    "    print(llm_judge)\n",
    "    eval_rslt = llm_evaluator.compute(\n",
    "        questions=questions, gt_answers=gt_answers, pred_answers=pred_answers\n",
    "    )\n",
    "    print(eval_rslt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DefaultLLMJudge(\n",
      "  judgement_query= Does the predicted answer contain the ground truth answer? Say True if yes, False if no., \n",
      "  (model_client): OpenAIClient()\n",
      "  (llm_evaluator): Generator(\n",
      "    model_kwargs={'model': 'gpt-4o', 'temperature': 1.0, 'max_tokens': 10}, trainable_prompt_kwargs=['task_desc_str', 'examples_str']\n",
      "    (prompt): Prompt(\n",
      "      template: <START_OF_SYSTEM_PROMPT>\n",
      "      {# task desc #}\n",
      "      {{task_desc_str}}\n",
      "      {# examples #}\n",
      "      {% if examples_str %}\n",
      "      {{examples_str}}\n",
      "      {% endif %}\n",
      "      <END_OF_SYSTEM_PROMPT>\n",
      "      ---------------------\n",
      "      <START_OF_USER>\n",
      "      {# question #}\n",
      "      {% if question_str is defined %}\n",
      "      Question: {{question_str}}\n",
      "      {% endif %}\n",
      "      {# ground truth answer #}\n",
      "      {% if gt_answer_str is defined %}\n",
      "      Ground truth answer: {{gt_answer_str}}\n",
      "      {% endif %}\n",
      "      {# predicted answer #}\n",
      "      Predicted answer: {{pred_answer_str}}\n",
      "      <END_OF_USER>\n",
      "      , prompt_kwargs: {'task_desc_str': 'You are an evaluator. Given the question(optional), ground truth answer(optional), and predicted answer, Does the predicted answer contain the ground truth answer? Say True if yes, False if no.', 'examples_str': None}, prompt_variables: ['task_desc_str', 'examples_str', 'pred_answer_str', 'question_str', 'gt_answer_str']\n",
      "    )\n",
      "    (model_client): OpenAIClient()\n",
      "  )\n",
      ")\n",
      "true\n",
      "true\n",
      "false\n",
      "LLMJudgeEvalResult(avg_score=0.6666666666666666, judgement_score_list=[1, 1, 0], confidence_interval=(0.013333333333333197, 1))\n"
     ]
    }
   ],
   "source": [
    "compute_llm_as_judge()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 🤩 G-eval\n",
    "\n",
    "If you have no reference text, you can also use G-eval [11]_ to evaluate the generated text on the fly.\n",
    "G-eval provided a way to evaluate:\n",
    "\n",
    "- `relevance`: evaluates how relevant the summarized text to the source text.\n",
    "- `fluency`: the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.\n",
    "- `consistency`: evaluates the collective quality of all sentences.\n",
    "- `coherence`: evaluates the the factual alignment between the summary and the summarized source.\n",
    "\n",
    "In our library, we provides the prompt for task `Summarization` and `Chatbot` as default.\n",
    "We also map the score to the range [0, 1] for the ease of optimization.\n",
    "\n",
    "Here is the code snippet to compute the G-eval score:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_g_eval_summarization(source, summary):\n",
    "    from adalflow.eval.g_eval import GEvalLLMJudge, GEvalJudgeEvaluator, NLGTask\n",
    "\n",
    "    model_kwargs = {\n",
    "        \"model\": \"gpt-4o\",\n",
    "        \"n\": 20,\n",
    "        \"top_p\": 1,\n",
    "        \"max_tokens\": 5,\n",
    "        \"temperature\": 1,\n",
    "    }\n",
    "\n",
    "    g_eval = GEvalLLMJudge(\n",
    "        default_task=NLGTask.SUMMARIZATION, model_kwargs=model_kwargs\n",
    "    )\n",
    "    print(g_eval)\n",
    "    input_template = \"\"\"Source Document: {source}\n",
    "    Summary: {summary}\n",
    "    \"\"\"\n",
    "\n",
    "    input_str = input_template.format(\n",
    "        source=source,\n",
    "        summary=summary,\n",
    "    )\n",
    "\n",
    "    g_evaluator = GEvalJudgeEvaluator(llm_judge=g_eval)\n",
    "\n",
    "    response = g_evaluator(input_strs=[input_str])\n",
    "    print(f\"response: {response}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GEvalLLMJudge(\n",
      "  default_task= NLGTask.SUMMARIZATION, prompt_kwargs={'Relevance': {'task_desc_str': 'You will be given a summary of a text.  Please evaluate the summary based on the following criteria:', 'evaluation_criteria_str': 'Relevance (1-5) - selection of important content from the source.\\n        The summary should include only important information from the source document.\\n        Annotators were instructed to penalize summaries which contained redundancies and excess information.', 'evaluation_steps_str': '1. Read the summary and the source document carefully.\\n        2. Compare the summary to the source document and identify the main points of the article.\\n        3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.\\n        4. Assign a relevance score from 1 to 5.', 'metric_name': 'Relevance'}, 'Fluency': {'task_desc_str': 'You will be given a summary of a text.  Please evaluate the summary based on the following criteria:', 'evaluation_criteria_str': 'Fluency (1-3): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.\\n        - 1: Poor. The summary has many errors that make it hard to understand or sound unnatural.\\n        - 2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.\\n        - 3: Good. The summary has few or no errors and is easy to read and follow.\\n        ', 'evaluation_steps_str': None, 'metric_name': 'Fluency'}, 'Consistency': {'task_desc_str': 'You will be given a summary of a text.  Please evaluate the summary based on the following criteria:', 'evaluation_criteria_str': 'Consistency (1-5) - the factual alignment between the summary and the summarized source.\\n        A factually consistent summary contains only statements that are entailed by the source document.\\n        Annotators were also asked to penalize summaries that contained hallucinated facts. ', 'evaluation_steps_str': '1. Read the summary and the source document carefully.\\n        2. Identify the main facts and details it presents.\\n        3. Read the summary and compare it to the source document to identify any inconsistencies or factual errors that are not supported by the source.\\n        4. Assign a score for consistency based on the Evaluation Criteria.', 'metric_name': 'Consistency'}, 'Coherence': {'task_desc_str': 'You will be given a summary of a text.  Please evaluate the summary based on the following criteria:', 'evaluation_criteria_str': 'Coherence (1-5) - the collective quality of all sentences.\\n        We align this dimension with the DUC quality question of structure and coherence whereby \"the summary should be well-structured and well-organized.\\n        The summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic.', 'evaluation_steps_str': '1. Read the input text carefully and identify the main topic and key points.\\n        2. Read the summary and assess how well it captures the main topic and key points. And if it presents them in a clear and logical order.\\n        3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.', 'metric_name': 'Coherence'}}\n",
      "  (model_client): OpenAIClient()\n",
      "  (llm_evaluator): Generator(\n",
      "    model_kwargs={'model': 'gpt-4o', 'n': 20, 'top_p': 1, 'max_tokens': 5, 'temperature': 1}, trainable_prompt_kwargs=[]\n",
      "    (prompt): Prompt(\n",
      "      template: \n",
      "      <START_OF_SYSTEM_PROMPT>\n",
      "      {# task desc #}\n",
      "      {{task_desc_str}}\n",
      "      ---------------------\n",
      "      {# evaluation criteria #}\n",
      "      Evaluation Criteria:\n",
      "      {{evaluation_criteria_str}}\n",
      "      ---------------------\n",
      "      {# evaluation steps #}\n",
      "      {% if evaluation_steps_str %}\n",
      "      Evaluation Steps:\n",
      "      {{evaluation_steps_str}}\n",
      "      ---------------------\n",
      "      {% endif %}\n",
      "      {{input_str}}\n",
      "      { # evaluation form #}\n",
      "      Output the score for metric (scores ONLY!): {{metric_name}}\n",
      "      <END_OF_SYSTEM_PROMPT>\n",
      "      , prompt_variables: ['input_str', 'metric_name', 'evaluation_criteria_str', 'task_desc_str', 'evaluation_steps_str']\n",
      "    )\n",
      "    (model_client): OpenAIClient()\n",
      "    (output_processors): FloatParser()\n",
      "  )\n",
      ")\n",
      "response: ({'Relevance': 0.4, 'Fluency': 0.6666666666666666, 'Consistency': 0.6, 'Coherence': 0.6, 'overall': 0.5666666666666667}, [{'Relevance': 0.4, 'Fluency': 0.6666666666666666, 'Consistency': 0.6, 'Coherence': 0.6, 'overall': 0.5666666666666667}])\n"
     ]
    }
   ],
   "source": [
    "source = (\n",
    "    \"Paul Merson has restarted his row with Andros Townsend after the Tottenham midfielder was brought on with only seven minutes remaining in his team 's 0-0 draw with Burnley on Sunday . 'Just been watching the game , did you miss the coach ? # RubberDub # 7minutes , ' Merson put on Twitter . Merson initially angered Townsend for writing in his Sky Sports column that 'if Andros Townsend can get in ( the England team ) then it opens it up to anybody . ' Paul Merson had another dig at Andros Townsend after his appearance for Tottenham against Burnley Townsend was brought on in the 83rd minute for Tottenham as they drew 0-0 against Burnley Andros Townsend scores England 's equaliser in their 1-1 friendly draw with Italy in Turin on Tuesday night The former Arsenal man was proven wrong when Townsend hit a stunning equaliser for England against Italy and he duly admitted his mistake . 'It 's not as though I was watching hoping he would n't score for England , I 'm genuinely pleased for him and fair play to him \\u00e2\\u20ac\\u201c it was a great goal , ' Merson said . 'It 's just a matter of opinion , and my opinion was that he got pulled off after half an hour at Manchester United in front of Roy Hodgson , so he should n't have been in the squad . 'When I 'm wrong , I hold my hands up . I do n't have a problem with doing that - I 'll always be the first to admit when I 'm wrong . ' Townsend hit back at Merson on Twitter after scoring for England against Italy Sky Sports pundit Merson ( centre ) criticised Townsend 's call-up to the England squad last week Townsend hit back at Merson after netting for England in Turin on Wednesday , saying 'Not bad for a player that should be 'nowhere near the squad ' ay @ PaulMerse ? ' Any bad feeling between the pair seemed to have passed but Merson was unable to resist having another dig at Townsend after Tottenham drew at Turf Moor .\",\n",
    ")\n",
    "summary = (\n",
    "    \"Paul merson was brought on with only seven minutes remaining in his team 's 0-0 draw with burnley . Andros townsend scored the tottenham midfielder in the 89th minute . Paul merson had another dig at andros townsend after his appearance . The midfielder had been brought on to the england squad last week . Click here for all the latest arsenal news news .\",\n",
    ")\n",
    "\n",
    "compute_g_eval_summarization(source=source, summary=summary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GEvalLLMJudge(\n",
      "  default_task= NLGTask.SUMMARIZATION, prompt_kwargs={'Relevance': {'task_desc_str': 'You will be given a summary of a text.  Please evaluate the summary based on the following criteria:', 'evaluation_criteria_str': 'Relevance (1-5) - selection of important content from the source.\\n        The summary should include only important information from the source document.\\n        Annotators were instructed to penalize summaries which contained redundancies and excess information.', 'evaluation_steps_str': '1. Read the summary and the source document carefully.\\n        2. Compare the summary to the source document and identify the main points of the article.\\n        3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.\\n        4. Assign a relevance score from 1 to 5.', 'metric_name': 'Relevance'}, 'Fluency': {'task_desc_str': 'You will be given a summary of a text.  Please evaluate the summary based on the following criteria:', 'evaluation_criteria_str': 'Fluency (1-3): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.\\n        - 1: Poor. The summary has many errors that make it hard to understand or sound unnatural.\\n        - 2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.\\n        - 3: Good. The summary has few or no errors and is easy to read and follow.\\n        ', 'evaluation_steps_str': None, 'metric_name': 'Fluency'}, 'Consistency': {'task_desc_str': 'You will be given a summary of a text.  Please evaluate the summary based on the following criteria:', 'evaluation_criteria_str': 'Consistency (1-5) - the factual alignment between the summary and the summarized source.\\n        A factually consistent summary contains only statements that are entailed by the source document.\\n        Annotators were also asked to penalize summaries that contained hallucinated facts. ', 'evaluation_steps_str': '1. Read the summary and the source document carefully.\\n        2. Identify the main facts and details it presents.\\n        3. Read the summary and compare it to the source document to identify any inconsistencies or factual errors that are not supported by the source.\\n        4. Assign a score for consistency based on the Evaluation Criteria.', 'metric_name': 'Consistency'}, 'Coherence': {'task_desc_str': 'You will be given a summary of a text.  Please evaluate the summary based on the following criteria:', 'evaluation_criteria_str': 'Coherence (1-5) - the collective quality of all sentences.\\n        We align this dimension with the DUC quality question of structure and coherence whereby \"the summary should be well-structured and well-organized.\\n        The summary should not just be a heap of related information, but should build from sentence to a coherent body of information about a topic.', 'evaluation_steps_str': '1. Read the input text carefully and identify the main topic and key points.\\n        2. Read the summary and assess how well it captures the main topic and key points. And if it presents them in a clear and logical order.\\n        3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.', 'metric_name': 'Coherence'}}\n",
      "  (model_client): OpenAIClient()\n",
      "  (llm_evaluator): Generator(\n",
      "    model_kwargs={'model': 'gpt-4o', 'n': 20, 'top_p': 1, 'max_tokens': 5, 'temperature': 1}, trainable_prompt_kwargs=[]\n",
      "    (prompt): Prompt(\n",
      "      template: \n",
      "      <START_OF_SYSTEM_PROMPT>\n",
      "      {# task desc #}\n",
      "      {{task_desc_str}}\n",
      "      ---------------------\n",
      "      {# evaluation criteria #}\n",
      "      Evaluation Criteria:\n",
      "      {{evaluation_criteria_str}}\n",
      "      ---------------------\n",
      "      {# evaluation steps #}\n",
      "      {% if evaluation_steps_str %}\n",
      "      Evaluation Steps:\n",
      "      {{evaluation_steps_str}}\n",
      "      ---------------------\n",
      "      {% endif %}\n",
      "      {{input_str}}\n",
      "      { # evaluation form #}\n",
      "      Output the score for metric (scores ONLY!): {{metric_name}}\n",
      "      <END_OF_SYSTEM_PROMPT>\n",
      "      , prompt_variables: ['input_str', 'metric_name', 'evaluation_criteria_str', 'task_desc_str', 'evaluation_steps_str']\n",
      "    )\n",
      "    (model_client): OpenAIClient()\n",
      "    (output_processors): FloatParser()\n",
      "  )\n",
      ")\n",
      "response: ({'Relevance': 1.0, 'Fluency': 1.0, 'Consistency': 1.0, 'Coherence': 0.8, 'overall': 0.95}, [{'Relevance': 1.0, 'Fluency': 1.0, 'Consistency': 1.0, 'Coherence': 0.8, 'overall': 0.95}])\n"
     ]
    }
   ],
   "source": [
    "compute_g_eval_summarization(source=gt, summary=pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Issues and feedback\n",
    "\n",
    "If you encounter any issues, please report them here: [GitHub Issues](https://github.com/SylphAI-Inc/LightRAG/issues).\n",
    "\n",
    "For feedback, you can use either the [GitHub discussions](https://github.com/SylphAI-Inc/LightRAG/discussions) or [Discord](https://discord.gg/ezzszrRZvT)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
